Python Logo

Introduction to Python with Application in Bioinformatics¶

Nanjiang Shu¶

2024-07-16 (Day 2)¶

Review of Day 1¶

  • Liternals and their associated data types, variables, and basic operations
    • Four basic data types: Numeric, String, Boolean, None
    • Collection of data types: List, Tuple, Set and Dictionary
    • Naming of variables, rules and conventions
  • Built-in functions and operations
    • print, type, len, max, sum
  • Loops
    • for loop and while loop
  • How to use if/elif/else statements to control the logic of the code

Review of the quiz from yesterday¶

  • Link to the quiz statistics

Questions?¶

Day 2¶

  • Session 1:
    • Python Script: Dump your code into a script
    • File I/O: Read and write files
  • Session 2:
    • Variants identification with VCF (Exercise after lunch)
  • Session 3:
    • Introduction to the Course Project
  • Session 4:
    • Team building - Spaghetti tower

Session 1: Python script, File I/O¶

Python script¶

A Python script is a file containing Python code that can be executed all at once.

  • File Extension: .py
In [ ]:
genotype = "AG"
phenotype = "expressed"
if genotype == "AG":
    # Only check phenotype if genotype is "AG"
    if phenotype == "expressed":
        print("Variant is active and expressed.")
    else:
        print("Variant is inactive.")
else:
    print("Non-target variant.")

Now we will save the code in a text file and try to run it in the terminal¶

About the file extension .py¶

  • Actually a Python script can have any extention or even without an extension.
  • We use it as a convention
  • Other benefits of using the .py entension - people and the OS can recognize it.

5 mins exercise¶

Put some of your Python code you have written yesterday and save it in a Python script, run it in a terminal with either


python yourscript.py

or

./yourscript.py (not required for the Windows user)

  • Replace yourscript.py with the actual name you use
  • You need to add #!/usr/bin/env python at the beginning of the script and also make the script executable in order to run it with the second method (only for Mac or Linux users)

File I/O - read and write files¶

Read from file¶


file = open("filename.txt", "r")
content = file.read()
file.close()

'r' specifies the mode for opening the file as a read-only text file. If the file does not exist, a FileNotFoundError will be raised.

Other types of mode:

  • 'rb': Opens the file as a binary file for reading.
  • 'r+ : Opens the file for reading and writing.

Extra reading: https://www.w3schools.com/python/python_file_handling.asp

Use the with keyword to ensure the file handle be closed automatically¶

file = open("filename.txt", "r")
content = file.read()
file.close()

with open("filename.txt", "r") as file:
    content = file.read()

You may also specify encoding='utf-8' when opening a file in Python to ensure proper handling of text files that contain non-ASCII characters, such as Chinese characters.¶

with open ("filename.txt", "r", encoding='utf-8') as file
    content = file.read()

However, for Python 3, the default encoding is usually 'utf-8', so it's not needed.¶

I have a file contains DNA sequences and want to use Python to read its content and then calculate the number of nucleotides, that is, get the length of the dna sequence.

In [ ]:
seqfile = "../files/one_dna_sequence.fa"

with open(seqfile, "r") as file:
    for line in file:
        print(line.strip())

Detour: useful methods for string¶

'string'.strip()       Removes whitespace
'string'.split()       Splits on whitespace into list

In [ ]:
string1  = '  an example string with whitespaces at both ends   '
string2 = string1.strip()
string2
In [ ]:
list1 = string2.split()
list1
In [ ]:
list2 = string1.strip().split()
list2

Write to file¶


file = open("output.txt", "w")
file.write("Hello, Python!")
file.close()

  • 'w': Opens a file for writing only. If the file does not exist, it creates a new file. If the file exists, its previous content will be truncated, effectively deleting the content before the new write operations.

Other modes:

  • 'a': Opens the file for writing, appending any new data you write to the end of the file's existing content, thus preserving the previous content.

Extra reading: https://www.w3schools.com/python/python_file_handling.asp

Similar to file reading, you can also use the with keyword¶

In [ ]:
with open("../newproject/output.txt", "w") as file:
    file.write("Hello, Python!")
In [ ]:
seqfile = "../files/one_dna_sequence.fa"

with open(seqfile, "r") as file:
    seqlength = 0
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            seqlength += len(line)

outfile = "../newproject/length_of_dna_sequence.txt"
with open(outfile, "w") as file:
    file.write("Length of DNA sequence: " + str(seqlength))

Day 2, Exercise 1¶

Link: https://python-bioinfo.bioshu.se/exercises.html


Take a break after the exercise¶

Session 2: Variants identification with VCF¶

  • A VCF (Variant Call Format) file is a text file format used in bioinformatics for storing gene sequence variations.

Description of the task¶

You have a VCF file with a number of samples. You are interested in only one of the samples (sample1) and one region (chr5, 1,200,000-1,300,000, including the start and end positions). What you want to know is whether this sample has any variants in this region, and if so, which variants.

What is your input?¶

Drawing

  1. CHROM: 1 (Chromosome 1)
  2. POS: 10177 (Position 10177 on the chromosome)
  3. ID: rs367896725 (Reference SNP ID)
  4. REF: A (Reference allele is A)
  5. ALT: AC (Alternative allele is AC)
  6. QUAL: 50 (Quality score)
  7. FILTER: PASS (status, meaning passed all quality control)
  8. FORMAT: GT:DP:CB (Genotype, Depth, Cell Barcode)
  9. SAMPLES: 0/1:30:SM (Sample genotypes and additional info)

Genotype field¶

Drawing

  • 0/1
    • For human, and many other organisms, there are two copies of genetic information, inherited from both parents.
    • 0 means DNA at this position is the same as reference allele (A),
    • 1 means has alternative allele, in this case AC

The notation of alternative allele is not always 1

  • 0/2
    • 2 means it is the second alternative allele, i.e. C in this case

Start with pseudocode!¶

  • Pseudocode is a description of what you want to do without actually using proper syntax

Task:¶

You have a VCF file with a number of samples. You are interested in only one of the samples (sample1) and one region (chr5, 1,200,000-1,300,000, including the start and end positions). What you want to know is whether this sample has any variants in this region, and if so, which variants.

Drawing

  • Open the file and loop over lines (ignore lines with started with #)
  • Identify lines where chromosome is equal to 5 and position is between 1,200,000 and 1,300,000
  • Isolate the column that contains the genotype for sample1, e.g. 0/1:30:SM
  • Extract only the genotypes (e.g. 0/1) from the column
  • Check if the genotype contains any alternate alleles, i.e., if not 0/0.
  • Print any variants containing alternate alleles for this sample between specified region

Open the file and loop over lines (ignore lines starting with #), and print the first record¶

Drawing

In [ ]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            print(line)
            break
# Next, find chromosome 5

Identify lines where chromosome is 5¶

Drawing

In [ ]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            if chrom == '5':
                print(cols)
                break

# Next, find the correct region

Identify lines where chromosome is 5 and position is between 1,200,000 and 1,300,000¶

Drawing

In [ ]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                print(cols)
                break
# Next, find the genotypes for sample1

Isolate the column that contains the genotype for sample1¶

Drawing

In [ ]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                geno_col = cols[9]
                print(geno_col)
                break
                    
# Next, extract the genotypes only

Extract only the genotypes from the column¶

Drawing

In [ ]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                geno_col = cols[9]
                geno = geno_col.split(':')[0]
                print(geno)
                break
                
# Next, find in which positions sample1 has alternate alleles

Check if the genotype contains any alternate alleles¶

Drawing

In [ ]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                geno_col = cols[9]
                geno = geno_col.split(':')[0]
                if geno not in ["0/0"]:
                    print(geno)
#Next, print nicely

Print any variants containing alternate alleles for this sample between specified region¶

In [ ]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                geno_col = cols[9]
                geno = geno_col.split(':')[0]
                if geno not in ["0/0"]:
                    ref = cols[3]
                    alt = cols[4]
                    info = chrom + ":" + str(pos) + "_" + ref + "-" + alt + " has genotype: "+ geno
                    print(info)

Day 2, Exercise 2¶

  • Link: https://python-bioinfo.bioshu.se/exercises.html

Lunch¶

Afternoon session¶


1. Quiz for Day 2¶

  • Link: https://python-bioinfo.bioshu.se/quiz.html

2. Introduction to the course project¶


Break¶

3. Team building¶

Introduction to the course project¶

Background¶

cystic fibrosis

source: mayoclinic.org

Cystic fibrosis (CF)

  • Genetic inherited disease
  • Produces thick and sticky mucus in organs, including lungs and the pancreas
  • Clogs the airways of patients and makes them difficult to breathe
  • No cure available but only symptom management, such as airway clearance

Genomic facts of Cystic Fibrosis¶

  • CF is caused by mutations in Cystic Fibrosis Transmembrane Conductance Regulator (CFTR).
  • The CFTR protein is an ion channel protein, acting like gates in a cell membrane that control the traffic of molecules through the membrane.
  • For regular people, CFTR acts as a gate for chloride ions. When chloride leaves the cell, it carries water with it, which makes mucus less thick.
  • For patients with CF, gene mutations in CFTR prevent this functionality, causing the mucus stays sticky and thick. cystic fibrosis

    source: cff.org

More about the CFTR gene¶

  • CFTR gene is located on chromosome 7 of the human genome
  • Over 1,500 mutations known to cause CF
  • One type of mutations
    • Non-synonymous (with amino acid changing) mutations that generate a premature termination codon (PTC), that further leads to a truncated CFTR protein (shortened length).

Goal of the project¶

Write a python program that:¶

  • Extract the correct CFTR transcript from the human genome
  • Translate it into its corresponding amino acid sequence
  • Determine if one or more patients have a premature stop codon

You will be guided step by step towards the final goal

Data¶

  • Human reference genome
    • Chromosome 7 in fasta format
    • Gene annotations in GTF (Gene Transfer Format) format
  • Genome sequencing data from five patients
    • Chromosome 7 in fasta format

Fasta format¶

plaintext
>MT dna:chromosome chromosome:GRCh38:MT:1:16569:1 REF
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA

GTF format¶

  • GTF stands for Gene Transfer Format
  • Holds information about gene structure
  • Tab-delimited
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
  1. seqname: The name of the sequence (typically a chromosome).
  2. source: The source of the annotation (e.g., ENSEMBL).
  3. feature: The type of feature (e.g., gene, transcript, exon).
  4. start: The starting position of the feature in the sequence.
  5. end: The ending position of the feature in the sequence.
  6. score: A score between 0 and 1000, or . if not applicable (indicating the reliability of the annotation).
  7. strand: The strand on which the feature is located (+ for the forward strand, - for the reverse strand).
  8. frame: The reading frame, one of '0', '1' or '2', or . if not applicable.
  9. attribute: A list of key-value pairs providing additional information about the feature.

Attributes¶

Some attributes (always semi-colon ';' separated key-value pairs):

  • gene_id: The stable identifier for the gene
  • gene_version: The stable identifier version for the gene
  • gene_name: The official symbol of this gene
  • gene_source: The annotation source for this gene
  • transcript_id: The stable identifier for this transcript
  • transcript_name: The symbold for this transcript derived from the gene name
  • exon_id: The stable identifier for this exon
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
1   havana  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";

Project page¶

https://python-bioinfo.bioshu.se/project.html